Combining Methods to Create Synthetic Microdata: Quantile Regression, Hot Deck, and Rank Swapping
نویسندگان
چکیده
Government agencies must simultaneously disseminate useful microdata and maintain confidentiality of individual records. Releasing synthetic data is one approach. We propose to create synthetic data using a combination of quantile regression, hot deck imputation, and rank swapping. The result is a releasable data set containing original values for a few key variables, synthetic quantile regression predictions for several variables, and imputed and perturbed values for remaining variables. The procedure should provide quality data to the user and simultaneously protect the confidentiality of respondents. The methods is illustrated by creating synthetic data for a Public Use Microdata Set from the American Community Survey.
منابع مشابه
Measuring Disclosure Risk for a Synthetic Data Set Created Using Multiple Methods
Government agencies must simultaneously maintain confidentiality of individual records and disseminate useful microdata. We propose a method to create synthetic data that combines quantile regression, hot deck imputation, and rank swapping. The result from implementation of the proposed procedure is a releasable data set containing original values for a few key variables, synthetic quantile reg...
متن کاملLHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection
In previous work by Domingo-Ferrer et al., rank swapping and multivariate microaggregation has been identified as well-performing masking methods for microdata protection. Recently, Dandekar et al. proposed using synthetic microdata, as an option, in place of original data by using Latin hypercube sampling (LHS) technique. The LHS method focuses on mimicking univariate as well as multivariate s...
متن کاملRe-identification Methods for Masked Microdata
Statistical agencies often mask (or distort) microdata in public-use files so that the confidentiality of information associated with individual entities is preserved. The intent of many of the masking methods is to cause only minor distortions in some of the distributions of the data and possibly no distortion in a few aggregate or marginal statistics In record linkage (as in nearest neighbor ...
متن کاملThe Impact of Alternative Imputation Methods on the Measurement of Income and Wealth: Evidence from the Spanish Survey of Household Finances
The goal of this paper is to emphasise the importance of the way of handling missing data and its impact on the outcome of empirical studies. Using the 2002 wave of the Spanish Survey of Household Finances (EFF), I study the performance of alternative methods: listwise deletion, non-stochastic, multiple and single imputation based on linear-regression models, and hot-deck procedures. Using desc...
متن کاملCommunity-Wide Health Risk Assessment Using Geographically Resolved Demographic Data: A Synthetic Population Approach
BACKGROUND Evaluating environmental health risks in communities requires models characterizing geographic and demographic patterns of exposure to multiple stressors. These exposure models can be constructed from multivariable regression analyses using individual-level predictors (microdata), but these microdata are not typically available with sufficient geographic resolution for community risk...
متن کامل